This presentation reflects the views of the authors and should not be construed to represent the views or policies of the NIH
This presentation reflects the views of the authors and should not be construed to represent the views or policies of the NIH
We often simplify our world, avoiding the grey
and want to associate this binary view of an outcome with other predictors we might be measuring
In other words, we're interested in the relationship between a binary outcome Y and a set of predictors X, denoting this by \(Y|X\)
We have been made to believe that
Binary outcomes = classification
and so find many algorithms for binary outcomes restricted to the classification problem
We have been made to believe that
Binary outcomes = classification
and so find many algorithms for binary outcomes restricted to the classification problem
However, the fact of the matter is that, as long as we believe E(Y|X) = P(Y=1|X) = p is continuous, we can do
regression
logistic regression
logistic regression
In fact, if you recall, we write \[ \log(\frac{p}{1-p}) = \beta_0+\beta_1X_1+\beta_2X_2 \]
logistic regression
In fact, if you recall, we write \[ \log(\frac{p}{1-p}) = \beta_0+\beta_1X_1+\beta_2X_2 \]
We then promptly dichotomize our predictions into 0-1 and look at misclassification rates
logistic regression
In fact, if you recall, we write \[ \log(\frac{p}{1-p}) = \beta_0+\beta_1X_1+\beta_2X_2 \]
We then promptly dichotomize our predictions into 0-1 and look at misclassification rates
WHY!!!
Probability machines are learning machines for binary outcomes which are
Probability machines are learning machines for binary outcomes which are
Several candiates exist for probability machines
Probability machines are learning machines for binary outcomes which are
Several candiates exist for probability machines
Biau and colleagues proved several consistency results for both RF and kNN
A PM has several advantages over classical regression methods
Let's compare probability machines to logistic regression, the industry standard
| Logistic regression | Probability machines |
|---|---|
| Assumes data from logistic regression | No such assumption |
| Explicit functional form | No such specification |
| Need to specify interactions | Interactions implicit |
| Predictors less than observations | Scalable to higher dimensions |
Some will argue that logistic regression is important to understand the
effect of predictors.
This is coming up soon!
For now, let's look at prediction
We generate data from a logistic regression model with
We fit three models to the generated data
Main effects logistic regression
glm(y~x1+x2+x3+..., family=binomial)
Main effects + two-way interactions logistic regression
glm(y~(x1+x2+x3+...)^2, family=binomial)
Random forest regression
randomForest(y~x1+x2+x3+...)
For this entire exercise, we do not change this code
Start with data from a main effects model (ORs of 1.2, 1.7, 2.5)
LR1 = main effects logistic regression
LR2 = main effects+interaction logistic regression
RF = random forest probability machine
Now add interactions (X1 x X2 = 2, X2 x X3 = 5)
LR1 = main effects logistic regression
LR2 = main effects+interaction logistic regression
RF = random forest probability machine
Now add interactions (X1 x X2 = 2, X2 x X3 = 5)
LR1 = main effects logistic regression
LR2 = main effects+interaction logistic regression
RF = random forest probability machine
Now let's look at arbitrary probabilities for each (X1,X2,X3) combination
LR1 = main effects logistic regression
LR2 = main effects+interaction logistic regression
RF = random forest probability machine
Now let's look at arbitrary probabilities for each (X1,X2,X3) combination
LR1 = main effects logistic regression
LR2 = main effects+interaction logistic regression
RF = random forest probability machine
Let's go back to your first regression course
"How much does the outcome change, on average, when a predictor changes by one unit, all other predictors remaining the same?"
Let's go back to your first regression course
"How much does the outcome change, on average, when a predictor changes by one unit, all other predictors remaining the same?"
This is based on the concept of counterfactuals
| X = 1 | X = 0 |
|
|
|
The counterfactual argument is, in essence
If we put an observation in the other landscape, what would it do?
Now put each observation in each landscape and record its predicted outcome
Note, for each observation we now have a \(p_1\) and a \(p_0\)
Now we can compute conditional odds ratios using \[ OR = \frac{p_1(1-p_0)}{(1-p_1)p_0}\] for each observation, and look at group-specific odds ratios by averaging or taking medians
Counterfactual and risk machines and applications developed by Jim Malley and Abhijit Dasgupta,
in collaboration with Joan Bailey-Wilson, Jason Moore and Silke Szymczak
Paper under final review at BioData Mining
Look at a main effects model
We have individual \(p_1\) and \(p_0\), so we can directly compute
Since we have a way of estimating counterfactuals, estimating conditional interaction effects are straightforward
Make 4 machines to "capture landscapes" when
Now compute the appropriate contrast (\(p_{11} - p_{10} - p_{01} + p_{00}\)) or ratio \[\frac{p_{11}(1-p_{10})}{p_{10}(1-p_{11})}/\frac{p_{01}(1-p_{00})}{p_{00}(1-p_{01})}\]
The Risk Machine\(^{TM}\) is cumbersome when you have many features.
We can actually do a faster scan of the data to find 2-way interactions
We call it the Interactor\(^{TM}\)
We fit one PM to the data and get predicted probabilities
We can now create classical interaction plots either on natural or logit scale
Machines
Risk machine
Interactor
Propensity scores quantify the likelihood that individuals receive a treatment given other covariates
It is used often in observational studies to "level the playing field"
We have successfully used probability machines to generate propensity scores instead of logistic regression. It accounts for non-linearities in relationships better than logistic regression
Methods developed by Abhijit Dasgupta
Wasko, Dasgupta, Hubert, Fries, Ward (2013) Arthritis & Rheumatism
The basic idea
Ongoing consulting by Abhijit Dasgupta with various collaborators
The basic idea
Mojirsheibani (1999) proposed the idea of combining classifiers such that the collective provably does better than any individual machine in the collective
Note that this is not an ensemble method
The Paris group, inspired by Jim, has extended this to collectives of regression machines
Leverages work of GKKW as well as DGL
R package forthcoming in Q1
Aurelie Fischer, Benjamin Guedj, Gerard Biau (Paris) inspired by Jim Malley
Paper under review at JRSS-B
The basic idea
Ongoing enterprise consulting and product development by Abhijit Dasgupta